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Sir: 

Prior to examination of the above-identified application, Applicant respectfully 
requests that the following amendments be entered into the application: 

IN THE CLAIMS : 

Please amend the claims as follows: 
Claim 3 , line 2, delete "or 2". 
Claim 4 , line 1 , delete "one"; 

line 2, delete "of claims 1 to 3" and insert -claim 1-. 
Claim 7 , line 2, delete "or 6". 
Claim 19 , line 2, delete "or 18". 
Claim 20 , line 1 , delete "one"; 

line 2, delete "of claims 1 7 to 19" and insert —claim 17--. 
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Please add the following new claims: 

—25. The voice recognition system according to claim 2, wherein the converter 
executes the elongation or contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction by carrying out the elongation or 
contraction in cepstrum space. 

26. The voice recognition system according to claim 2, wherein the 
elongation/contraction estimating unit executes the elongation or contraction of 
spectrum on frequency axis with warping function defining the form of elongation or 
contraction by using estimation derived from the best likelihood estimation of HMM 
(hidden Marcov model) in cepstrum space. 

27. The voice recognition system according to claim 3, wherein the 
elongation/contraction estimating unit executes the elongation or contraction of 
spectrum on frequency axis with warping function defining the form of elongation or 
contraction by using estimation derived from the best likelihood estimation of HMM 
(hidden Marcov model) in cepstrum space. 

28. The reference pattern learning system according to claim 6, wherein the 
elongation/contraction estimating unit executes the elongation or contraction of 
spectrum on frequency axis with warping function defining the form of elongation or 
contraction by using estimation derived from the best likelihood estimation of HMM 
(hidden Marcov model) in cepstrum space. 

29. The voice recognition method according to claim 18, wherein the 
elongation or contraction of spectrum on frequency axis with warping function defining 
the form of elongation or contraction is executed by carrying out the elongation or 
contraction in cepstrum space. 

30. The voice recognition method according to claim 18, wherein the 
elongation/contraction estimating process executes the elongation or contraction of 
spectrum on frequency axis with warping function defining the form of elongation or 
contraction by using estimation derived from the best likelihood estimation of HMM 
(hidden Marcov model) in cepstrum space. 



002.404440 



Atty. Dkt. No. 071671/0156 



31 . The voice recognition method according to claim 1 9, wherein the 
elongation/contraction estimating process executes the elongation or contraction of 
spectrum on frequency axis with warping function defining the form of elongation or 
contraction by using estimation derived from the best likelihood estimation of HMM 
(hidden Marcov model) in cepstrum space.— 



Applicants respectfully request that the foregoing amendments to Claim 3-4, 7 
and 19-20 and new Claims 25-31 be entered in order to avoid this application incurring 
a surcharge for the presence of one or more multiple dependent claims. 
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SPEAKER'S VOICE RECOGNITION SYSTEM, METHOD AND RECORDING 
MEDIUM 

BACKGROUND OF TH F. INVENTION 

The present invention relates to indefinite 
5 speaker's voice recognition system and method as well 
* as acoustic model leaning method and recording medium 
with a voice recognition program recorded therein and, 
more particularly, to voice recognition system capable 
of normalizing speakers on frequency axis, learning 
10 system for normalization, voice recognition method, 
learning method for normalization and recording medium, 
in which a program for voice recognition and a learning 
program for normalization are stored. 

Spectrum converters in prior art voice recognition 
15 systems are disclosed in, for instance, Japanese Patent 
Laid-Open No. 6-214596 (referred to as Literature 1) and 
Puming Zhan and Martin Westphalk, "Speaker Normalization 
Based on Frequency Warping", ICASSP, 1039-1042, 1997 
(referred to as Literature 2). 
20 For example, Literature 1 discloses a voice 

recognition system, which comprises a frequency 
correcting means for correcting the frequency 
characteristic of an input voice signal on the basis of 
a plurality of predetermined different frequency 
25 characteristic correction coefficients, a frequency 
axis converting means for converting the frequency axis 
of the input voice signal on the basis of a plurality 
of predetermined frequency axis conversion coefficients, 



1 



a feature quantity extracting means for extracting the 
feature quantity of the input voice signal as input voice 
feature quantity, a reference voice storing means for 
storing a reference voice feature quantity, a frequency 

5 characteristic correcting means, a frequency axis 

converting means, a collating means for collating the 
input voice feature quantity obtained as a result of 
processes in the frequency characteristic correcting 
means and the reference voice feature quantity stored 

10 in the reference voice storing means, a speaker adopting 
phase function and a voice recognition phase function 
being included in the voice recognition system. In the 
voice recognition process in this system, in the speaker 
adopting phase an unknown speaker's voice signal having 

15 a known content is processed in the frequency 
characteristic correcting means, frequency axis 
converting means and feature quantity extracting means 
for each of the plurality of different frequency 
characteristic correction coefficients and the 

20 plurality of different frequency axis conversion 

coefficients, the input voice feature quantity for each 
coefficient and a reference voice feature quantity of 
the same content as the above known content are collated 
with each other, and a frequency characteristic 

25 correction coefficient and a frequency axis conversion 
coefficient giving a minimum distance are selected. In 
the voice recognition phase, the input voice feature 
quantity is determined by using the selected frequency 
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characteristic correction coefficient and frequency 
axis conversion coefficient and collated with the 
reference voice feature quantity. 

In these prior art voice recognition systems, for 

5 improving the recognition performance the spectrum 
converter causes elongation or contraction of the 
spectrum of the voice signal on the frequency axis with 
respect to the sex, age, physical conditions, etc. of 
the individual speakers. For spectrum elongation and 

10 contraction on the frequency axis, a function, which 
permits variation of the outline of the elongation and 
contraction with an adequate parameter, is defined to 
be used for elongation or contraction of the spectrum 
of the voice signal on the frequency axis. The function 

15 which is used for elongating or contracting the spectrum 
of the voice signal on the frequency axis is referred 
to as "warping function", and the parameter for defining 
the outline of the warping function is referred to as 
"elongation/contraction parameter" . 

20 Heretofore, a plurality of warping parameter 

values are prepared as elongation/contraction parameter 
of the warping function ("warping parameter"), the 
spectrum of the voice signal is elongated or contracted 
on the frequency axis by using each of these values, an 

25 input pattern is calculated by using the elongated or 
contracted spectrum and used together with reference 
pattern to obtain distance, and the value corresponding 
to the minimum distance is set as warping parameter value 
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at the time of the recognition. 

The spectrum converter in the prior art voice 
recognition system, will now be described with reference 
to the drawings. Fig. 9 is a view showing an example of 

5 the construction of the spectrum converter in the prior 
art voice recognition system. Referring to Figure 9, 
this spectrum converter in the prior art, comprises an 
FFT (Fast Fourier Transform) unit 301, an 
elongation/contraction parameter memory 302, a 

10 frequency converter 303, an input pattern calculating 
unit 3 04, a matching unit 306, a reference pattern unit 
305 and an elongation/contraction parameter selecting 
unit 307. The FFT unit 301 cuts out the input voice 
signal for every unit interval of time and causes Fourier 

15 transform of the cut-out signal to obtain a frequency 
spectrum. 

A plurality of elongation/contraction parameter 
values for determining the elongation or contraction of 
frequency are stored in the elongation/contraction 

20 parameter memory 302. The frequency converter 303 

executes a frequency elongation/contraction process on 
the spectrum fed out from the FFT unit 301 using a warping 
function with the outline thereof determined by 
elongation/contraction parameter, and feeds out a 

25 spectrum obtained after the frequency 
elongation/contraction process as 

elongation/contraction spectrum. The input pattern 
calculating unit 304 calculates and outputs an input 
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pattern by using the elongation/contraction spectrum fed 
out from the frequency converter 303. The input pattern 
represents, for instance, a parameter time series 
representing an acoustical feature such as cepstrum. 
5 The reference pattern is formed by using a large 

number of input patterns and averaging phoneme unit input 
patterns belonging to the same class by a certain type 
of averaging means . For the preparation of the reference 
pattern, see "Fundamentals of Voice Recognition", Part 
10 I, translated and edited by Yoshii, NTT Advanced 
Technology Co., Ltd., 1995, pp. 63 (Literature 3). 

Reference patterns can be classified by the 
recognition algorithm. For example, time series 
reference patterns with input patterns arranged in the 
15 phoneme time series order are obtainable in the case of 
DP (Dynamic Programming) matching, and status series and 
connection data thereof are obtainable in the HMM (hidden 
Markov Model) case. 

The matching unit 306 calculates distance by using 
20 reference pattern 3 05 matched to the content of voice 
inputted to the FFT unit 301 and the input pattern. The 
calculated distance corresponds to likelihood in the HMM 
(hidden Marcov model) case concerning the reference 
pattern and to the distance of the optimum route in the 
25 DP matching case. The elongation/contraction parameter 
selecting unit 307 selects a best matched 
elongation/contraction parameter in view of matching 
property obtained in the matching unit 306. 
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Fig. 10 is a flow chart for describing a process 
executed in a prior art spectrum matching unit. The 
operation of the prior art spectrum matching unit will 
now be described with reference to Figs. 9 to 10. The 
5 FFT unit 301 executes the FFT operation on voice signal 
to obtain the spectrum thereof (step D101 in Fig. 10). 
The frequency converter 303 executes elongation or 
contraction of the spectrum on the frequency axis by using 
input elongation/contraction parameter (D106) (step 

10 D102). The input pattern calculating unit 304 

calculates the input pattern by using the frequency axis 
elongated or contracted spectrum (step D103). The 
matching unit 305 determines the distance between 
reference pattern (D107) and the input pattern (D104). 

15 The sequence of processes from step D101 to step D104, 
is executed for all the elongation/contraction parameter 
values stored in the elongation/contraction parameter 
memory 3 02 (step D105). 

When 10 elongation/contraction parameter values 

20 are stored in the elongation/contraction parameter 

memory 302, the process sequence from step D101 to D104 
is repeated 10 times to obtain 10 different distances. 
The elongation/contraction parameter selecting unit 3 07 
compares the distances corresponding to all the 

25 elongation/contraction parameters, and selects the 

elongation/contraction parameter corresponding to the 
shortest distance (step D108). 

However, the above prior art spectrum converter has 
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the following problems. 

The first problem is that increased computational 
effort is required in the elongation/contraction 
parameter value determination. This is so because in the 
5 prior art spectrum converter it is necessary to prepare 
a plurality of elongation/contraction parameter values 
and execute the FFT process, the spectrum frequency 
elongation/contraction process, the input pattern 
calculation repeatedly a number of times corresponding 

10 to the number of these values. 

The second problem is that it is possible to fail 
to obtain sufficient effects of the frequency elongation 
and contraction on the voice recognition system. This 
is so because the elongation/contraction parameter 

15 values are all predetermined, and none of these values 
may be optimum to an unknown speaker. 
SU MMARY OF THE INVENTION 

The present invention was made in view of the above 
problems, and its main object is to provide voice 

20 recognition system and method and also recording medium, 
which permits calculation of the optimum 
elongation/contraction parameter value for each speaker 
with less computational effort and can thus improve 
performance. The above and other objects and features 

25 of the present invention will now become immediately 
apparent from the following description. 

According to a first aspect of the present 
invention, there is provided a voice recognition system 
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comprising a spectrum converter for elongating or 
contracting the spectrum of a voice signal on the 
frequency axis, the spectrum converter including: an 
analyzer for converting an input voice signal to an input 
5 pattern including cepstrum; a reference pattern memory 
with reference patterns stored therein; an 
elongation/contracting estimating unit for outputting 
an elongation/contraction parameter in the frequency 
axis direction by using the input pattern and the 
10 reference patterns; and a converter for converting the 
input pattern by using the elongation/contraction 
parameter. 

According to a second aspect of the present 
invention, there is provided a voice recognition system 

15 comprising: an analyzer for converting an input voice 
signal to an input pattern including a cepstrum; a 
reference pattern memory for storing reference patterns; 
an elongation/contraction estimating unit for 
outputting an elongation/contraction parameter in the 

20 frequency axis direction by using the input pattern and 
reference patterns; a converter for converting the input 
pattern by using the elongation/contraction parameter; 
and a matching unit for computing the distances between 
the elongated or contracted input pattern fed out from 

25 the converter and the reference patterns and outputting 
the reference pattern corresponding to the shortest 
distance as result of recognition. 

The converter executes the elongation or 
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contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. The elongation/contraction estimating unit 
5 executes the elongation or contraction of spectrum on 
frequency axis with warping function defining the form 
of elongation or contraction by using estimation derived 
from the best likelihood estimation of HMM (hidden Marcov 
model) in cepstrum space. 

10 According to a third aspect of the present 

invention, there is provided a reference pattern 
learning system comprising: a learning voice memory with 
learning voice data stored therein; an analyzer for 
receiving a learning voice signal from the learning voice 

15 memory and converting the learning voice signal to an 
input pattern including cepstrum; a reference pattern 
memory with reference patterns stored therein; an 
elongation/contraction estimating unit for outputting 
an elongation/contraction parameter in frequency axis 

20 direction by using the input pattern and the reference 
patterns ; a converter for converting the input pattern 
by using the elongation/contraction pattern; a reference 
pattern estimating unit for updating the reference 
patterns stored in the reference pattern memory for the 

25 learning voice data by using the elongated or contracted 
input pattern fed out from the converter and the reference 
patterns; and a likelihood judging unit for monitoring 
distance changes by computing distances by using the 
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elongated or contracted input pattern fed out from the 
converter and the reference patterns . 

The converter executes the elongation or 
contraction of spectrum on frequency axis with warping 
5 function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. The elongation/contraction estimating unit 
executes the elongation or contraction of spectrum on 
frequency axis with warping function defining the form 

10 of elongation or contraction by using estimation derived 
from the best likelihood estimation of HMM ( hidden Marcov 
model) in cepstrum space. 

According to a fourth aspect of the present 
invention, there is provided a voice quality converting 

15 system comprising; an analyzer for converting an input 
voice signal to an input pattern including a cepstrum; 
a reference pattern memory for storing reference 
patterns; an elongation/contraction estimating unit for 
outputting an elongation/contraction parameter in the 

20 frequency axis direction by using the input pattern and 
reference patterns; a converter for converting the input 
pattern by using the elongation/contraction parameter; 
and an inverse converter for outputting a signal waveform 
in time domain by inversely converting the time serial 

25 input pattern obtained after the elongation/contraction 
supplied from the converter. 

According to a fifth aspect of the present 
invention, there is provided a recording medium for a 
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computer constituting a spectrum converter by executing 
elongation or contraction of the spectrum of a voice 
signal on frequency axis, in which is stored a program 
for executing the following processes: (a) an analyzing 
5 process for converting an input voice signal to an input 
pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 

10 reference patterns stored in a reference pattern memory; 
and (c) a converting process for converting the input 
pattern by using the elongation/contraction parameter. 

According to a sixth aspect of the present 
invention, there is provided a recording medium for a 

15 computer constituting a system for voice recognition by 
executing elongation or contraction of the spectrum of 
a voice signal on frequency axis, in which is stored a 
program for executing the following processes: (a) an 
analyzing process for converting an input voice signal 

20 to an input pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern memory; 

25 (c) a converting process for converting the input pattern 
by using the elongation/contraction parameter; and (d) 
a matching process for computing the distances between 
the elongated or contracted input pattern and the 
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reference patterns and outputting the reference pattern 
corresponding to the shortest distance as result of 
recognition . 

The converting process executes the elongation or 
5 contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. The elongation/contraction estimating process 
executes the elongation or contraction of spectrum on 

10 frequency axis with warping function defining the form 
of elongation or contraction by using estimation derived 
from the best likelihood estimation of HMM ( hidden Marcov 
model) in cepstrum space. 

According to a seventh aspect of the present 

15 invention, there is provided, in a computer constituting 
a system for learning reference patterns from learning 
voice data, a recording medium, in which is stored a 
program, for executing the following processes: (a) an 
analyzing process for receiving learning voice data from 

20 learning voice memory with learning voice data stored 
therein and converting the received learning voice data 
to an input pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
outputting an elongation/contraction parameter in 

25 frequency axis direction by using the input pattern and 
the reference patterns stored in the reference pattern 
memory; (c) a converting process for converting the input 
pattern by using the elongation/contraction parameter; 
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(d) a reference pattern estimating process for updating 
the reference patterns for the learning voice data by 
using the elongated or contracted pattern fed out in the 
converting process and the reference patterns and; (e) 
5 a likelihood judging process for calculating the 

distances between the elongated or contracted input 
pattern after conversion in the converting process and 
the reference patterns and monitoring changes in 
distance. 

10 The converting process executes the elongation or 

contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. The elongation/contraction estimating process 

15 executes the elongation or contraction of spectrum on 
frequency axis with warping function defining the form 
of elongation or contraction by using estimation derived 
from the best likelihood estimation of HMM (hidden Marcov 
model) in cepstrum space. 

20 According to an eighth aspect of the present 

invention, there is provided a recording medium for a 
computer constituting a spectrum conversion by executing 
elongation or contraction of the spectrum of a voice 
signal on frequency axis, in which is stored a program 

25 for executing the following processes: (a) an analyzing 
process for converting an input voice signal to an input 
pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
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outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern memory; 
(c) a converting process for converting the input pattern 
5 by using the elongation/contraction parameter; and (d) 
an inverse converting process for outputting a signal 
waveform in time domain by inversely converting the time 
serial input pattern obtained after the 
elongation/contraction supplied from the converter. 

10 According to a ninth aspect of the present 

invention, there is provided a spectrum converting 
method for elongating or contracting the spectrum of a 
voice signal on the frequency axis, comprising: a first 
step for converting an input voice signal to an input 

15 pattern including cepstrum; a second step for outputting 
an elongation/contraction parameter in the frequency 
axis direction by using the input pattern and the 
reference patterns stored in a reference pattern memory; 
and a third step for converting the input pattern by using 

20 the elongation/contraction parameter. 

According to a tenth aspect of the present 
invention, there is provided a voice recognition method 
comprising: a first step for converting an input voice 
signal to an input pattern including a cepstrum; a second 

25 step for outputting an elongation/contraction parameter 
in the frequency axis direction by using the input pattern 
and reference patterns stored in a reference pattern 
memory; a third step for converting the input pattern 
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by using the elongation/contraction parameter; and a 
fourth step for computing the distances between the 
elongated or contracted input pattern and the reference 
patterns and outputting the reference pattern 
5 corresponding to the shortest distance as result of 
recognition. 

The e elongation or contraction of spectrum on 
frequency axis with warping function defining the form 
of elongation or contraction is executed by carrying out 

10 the elongation or contraction in cepstrum space. The 
elongation/contraction estimating process executes the 
elongation or contraction of spectrum on frequency axis 
with warping function defining the form of elongation 
or contraction by using estimation derived from the best 

15 likelihood estimation of HMM (hidden Marcov model) in 
cepstrum space. 

According to an eleventh aspect of the present 
invention, there is provided a reference pattern 
learning method comprising: a first step for receiving 

20 a learning voice signal from the learning voice memory 
and converting the learning voice signal to an input 
pattern including cepstrum; a second step for outputting 
an elongation/contraction parameter in frequency axis 
direction by using the input pattern and the reference 

25 patterns stored in a reference pattern memory; a third 
step for converting the input pattern by using the 
elongation/contraction pattern; a fourth step for 
updating the reference patterns for the learning voice 
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data by using the elongated or contracted input pattern 
and the reference patterns; and a fifth step for 
monitoring distance changes by computing distances by 
using the elongated or contracted input pattern and the 
5 reference patterns . 

The third step executes the elongation or 
contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 

10 space. The second step executes the elongation or 

contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by using estimation derived from the best likelihood 
estimation of HMM (hidden Marcov model) in cepstrum 

15 space. 

According to a twelfth aspect of the present 
invention, there is provided a voice recognition method 
of spectrum conversion to convert the spectrum of a voice 
signal by executing elongation or contraction of the 

20 spectrum on frequency axis, wherein: the spectrum 

elongation or contraction of the input voice signal as 
defined by a warping function is executed on cepstrum, 
the extent of elongation or contraction of the spectrum 
on the frequency axis is determined with 

25 elongation/contraction parameter included in warping 
function, and an optimum value is determined as 
elongation/contraction parameter value for each 
speaker . 
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Other objects and features will be clarified from 
the following description with reference to attached 
drawings . 

BRIEF DESCRIPTION OF THE DRAWINGS 
5 Fig. 1 is a view showing the construction of a 

spectrum converter in a first embodiment of the voice 
recognition system according to the present invention; 

Fig. 2 is a flow chart for explaining the process 
in the first embodiment of the present invention; 
10 Fig. 3 is a view showing the construction of the 

second embodiment of the present invention; 

Fig. 4 is a flow chart for describing the process 
sequence in the second embodiment of the present 
invention; 

15 Fig. 5 is a view showing the construction of the 

third embodiment of the present invention; 

Fig. 6 is a flow chart for describing the process 
in the third embodiment of the present invention 

Fig. 7 is a view showing the construction of the 
20 fourth embodiment of the present invention; 

Fig. 8 is a view showing the construction of the 
fifth embodiment of the present invention; 

Fig. 9 is a view showing an example of the 
construction of the spectrum converter in the prior art 
25 voice recognition system; and 

Fig. 10 is a flow chart for describing a process 
executed in a prior art spectrum matching unit. 
PREFERRE D EMBO DIMENTS OF TH E INVEN T I O N 
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Embodiments of the present invention will now be 
described in detail with reference to the drawings. 

A system according to the present invention 
generally comprises an analyzer unit 1 for converting 
5 an input voice signal to an input pattern containing 
cepstrum, an elongation/contraction estimating unit 3 
for outputting an elongation/contraction parameter in 
the frequency axis direction by using an input pattern 
and a reference pattern, and a converter unit 2 for 
10 converting an input pattern by using an 
elongation/contraction parameter . 

The system further comprises a matching unit ( i . e . , 
recognizing unit 101) for calculating the distance 
between the input pattern converted by the converter 2 
15 and reference patterns and outputting the reference 
pattern corresponding to the shortest distance as result 
of recognition. 

The elongation/contraction estimating unit 3 
estimates an elongation/contraction parameter by using 
20 a cepstrum contained in the input pattern. Thus, 

according to the present invention it is not necessary 
to store various values in advance when determining the 
elongation/contraction parameter. Neither it is 
necessary to execute distance calculation in connection 
25 with various values. 

Furthermore, the system according to the present 
invention comprises a leaning voice memory 201 for 
storing learning voices, an analyzer 1 for receiving the 
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leaning voice data from the learning voice memory 201 
and converting the received data to input pattern 
including cepstrum, a reference pattern memory 4 for 
storing reference patterns, an elongation/contraction 
5 estimating unit 3 for outputting an 

elongation/contraction parameter in the frequency axis 
direction by using the input pattern and the reference 
pattern, a converter 2 for converting an input pattern 
by using the elongation/contraction parameter, a 

10 reference pattern memory for storing the reference 

patterns, a reference pattern estimating unit 202 for 
updating the reference pattern for voice for learning 
by utilizing the input pattern after elongation or 
contraction fed out from the converter and the reference 

15 patterns, and a likelihood judging unit 2 03 for computing 
the distance by utilizing the input pattern after 
elongation or contraction and the reference patterns and 
monitoring changes in the distance. 

Fig. 1 is a view showing the construction of a 

20 spectrum converter in a first embodiment of the voice 
recognition system according to the present invention. 
Referring to Fig. 1, the spectrum converter in the first 
embodiment of the voice recognition system comprises an 
analyzer 1, a converter 2, an elongation/contraction 

25 estimating unit 3 and a reference pattern memory 4 . 

The analyzer 1 cuts out a voice signal for every 
predetermined interval of time, obtains the spectrum 
component of the cut-out signal by using FFT ( Fast Fourier 



19 



Transform) or LPC (Linear Predictive Coding) analysis, 
obtains a melcepstrum for extracting the envelope 
component of the melspectrunm component through 
conversion to melscale taking the human acoustical sense 
5 into account, and feeds out the melcepstrum, the change 
therein, the change in the change, etc. as input pattern. 
The converter 2 executes elongation or contraction of 
frequency by converting the melcepstrum in the input 
pattern. An example of conversion executed in the 

10 converter 2 will now be described in detail. 

According to Oppenheim "Discrete Representation of 
Signals", Proc . IEEE, 60, 681-691, June 1972 (Literature 
4), the frequency conversion with a primary full 
band-pass filter as represented by Formula (1) given 

15 below, can be expressed by Formula (2) as a recursive 
expression using cepstrum (symbol c and subscripts being 
dimension numbers of cepstrum) . 



' (i-l) (0 
Cm ~C m - 

1,0. 



m = l 
m > 2 



(2) 



The conversion in the cepstrum space given by 
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Formula { 2 ) is equivalent to the frequency of the spectrum 
given by Formula (1). Accordingly , the converter 102 
executes elongation or contraction of the spectrum 
frequency without direct use of the spectrum but by 
5 executing the conversion given by Formula (2) derived 
from Formula ( 1 ) on the input pattern with Formula ( 1 ) 
as warping function and with a in Formula (1) as 
elongation/contraction parameter. The input pattern 
obtained after the conversion is fed out as converted 

10 input pattern. 

Reference patterns are stored in the reference 
pattern memory 4 . The reference patterns can be 
substituted for by hidden Marcov models (or HMMs ) or time 
series reference patterns such as phoneme time series 

15 as phonetic data in units of words or phonemes. In this 
embodiment, the reference patterns are HMMs. Data 
constituting HMM may be the average vector in continuous 
Gauss distribution, variance, inter-state transition 
probability, etc. 

20 The elongation/contraction estimating unit (or 

also referred to as elongation/contraction parameter 
estimating unit) 3, obtains alignment of the input 
pattern by using HMM corresponding to the voice signal 
inputted to the analyzer 1. By the term "alignment" is 

25 meant the post-probability at each instant and in each 
state of HMM. 

The alignment may be obtained by using such 
well-known method as Viterbi algorithm and 
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forward/backward algorithm described in "Fundamentals 
of Voice Recognition (Part II), translated and edited 
by Furui, NTT Advanced Technology Co., Ltd., 1995, pp. 
102-185 (Literature 5). 
5 The elongation/contraction parameter is 

calculated by using the obtained alignment, the HHM and 
the input pattern. The elongation/contraction 
parameter is calculated by using Formula (4). 

4 = c 2 + <-c t + 3 c 3 ) + « 2 Hc 2 + 6 c 4 ) + a " (ct ~ 9 Cs + 10 c 5 ) + > 

£ 3 = G + a(-2 C2 + 4 a ) + a 2 ( Cl - 9 Cs + 10 Cs ) + « 3 (6 Ca - 24 c+ + 20 Cfi ) + , 



(m + l) Cm+ a, m = 0 

{(m + 1) Cm+1 - (m - l)c,_ J». m > 0 



(4) 



Formula (4) is derived by developing the recursive 
15 equation of Formula (2) with respect to the 

elongation/contraction parameter as in Formula (3), 
approximating the result of development with the first 
degree term of a, introducing the result in Q function 
of HMM for likelihood estimation as described in 
20 Literature 4 and maximizing the Q function. 

The function thus derived is given by Formula (5). 
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In Formula (5), c represents the melcepstrum part 
of the above input pattern, jU represents the average 
vector of HMM, a represents the variation of HMM, and 
5 7 represents the post-probability at instant t and in 
state j and mixed state k as alignment data. 

The post-probability is presence probability at a 
certain instant and in a certain state in the case of 
the forward/backward algorithm, and in the case of 
10 Viterbi algorithm it is "1" in the case of presence in 
an optimum route at a certain instant and in a certain 
time and "0" otherwise. 

While Formula ( 1 ) was given as the warping function 
in this embodiment, it is by no means limitative, and 
15 according to the present invention it is possible to adopt 
any formula. Also, while the first degree approximation 
of Formula (2) was used to derive Formula (5), it is also 
possible to use second and higher degree approximations. 
Fig. 2 is a flow chart for explaining the process 
20 in the first embodiment of the present invention. The 
overall operation of the first embodiment will now be 
described in detail with reference to Figs. 1 and 2. 
Subsequent to the input of a voice signal (step A101 in 
Fig. 2), the analyzer 1 calculates the input pattern 
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(A102). Then, the elongation/contraction estimating 
unit 3 calculates the elongation/contraction pattern by 
using the input pattern fed out from the analyzer 1 and 
inputted HMM (A105) (step A103). Then, the converter 2 
5 obtains converted input pattern from the input pattern 
from the analyzer 1 by using the conversion function of 
either one of Formulas (2) to (4) (step Al 04). The value 
of a is "0" in the case of the first utterance, while 
using values fed out from the elongation/contraction 

10 estimating unit 3 as a in the cases of the second and 
following utterances. 

The first embodiment of the present invention has 
the following effects. In the first embodiment, the 
input pattern fed out from the analyzer 1 is inputted 

15 to the converter 2 , and the spectrum frequency elongation 
and contraction may be executed in a melcepstrum range. 
Where Formula (5) is used, repeat calculation as 
described before in the prior art is unnecessary, and 
analysis and other processes need be executed only once. 

20 It is thus possible to reduce computational effort for 
the elongation/contraction parameter estimation. 

A second embodiment of the present invention will 
now be described. Fig. 3 is a view showing the 
construction of the second embodiment of the present 

25 invention. The second embodiment of the voice 

recognition system comprises an analyzer 1, converter 
2, an elongation/contraction estimating unit 3, a 
recognizing unit 101 and a reference pattern memory 4. 
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The analyzer 1, a converter 2, elongation/contraction 
estimating unit 3 and reference pattern memory 4 are the 
same as those described before in the description of the 
first embodiment. Specifically, like the first 
5 embodiment, the analyzer 1 analyzes the voice signal, 
and then calculates and feeds out the input pattern. 
Also like the first embodiment, the converter 2 converts 
the input pattern, and feeds out the converted input 
pattern. Furthermore, like the first embodiment, HMM 
10 constituted by average vector of the input pattern, 
variance, etc. is stored as elements representing 
phoneme in the reference pattern memory 4. 

The recognizing unit (or matching unit) 101 
executes recognition by checking which HMM is well 
15 matched to the converted input pattern fed out from the 
converter. The matching is executed by such as well- 
known method as Viterbi algorithm or forward/backward 
algorithm shown in Literature 4. 

Fig. 4 is a flow chart for describing the process 
20 sequence in the second embodiment of the present 

invention. Referring to Figs. 3 and 4, the overall 
operation of the second embodiment of the present 
invention will be described in detail. 

The analyzer 1 analyzes the input voice signal 
25 (step B101 in Fig. 4) and calculates the input pattern 
(step B102). The converter 2 obtains the converted 
pattern from the input pattern fed out from the analyzer 
1 by using the conversion function of either one of 
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Formulas (2) to (4) (step B103). The value of a is "0" 
in the case of the first voice, while warping parameter 
values fed out from the elongation/contraction 
estimating unit 3 are used as a in the cases of the second 
5 and following voices. Then, the recognizing unit 101 
executes a recognizing process by using the converted 
input pattern (step B104) . At this time, HMM is inputted 
from the reference pattern memory 4 to the recognizing 
unit 101 (step B106). Subsequent to the recognizing 

10 process, the elongation/contraction parameter 

estimating unit 3 calculates the elongation/contraction 
parameter is calculated (step B105). Thereafter, the 
process is repeated from the voice input process in step 
B101 by using the elongation/contraction parameter 

15 obtained and the step B105. 

The second embodiment has the following functional 
effect. The second embodiment of the present invention 
comprises the spectrum converter 100 and the recognizing 
unit 101 in the first embodiment. Thus, whenever the 

20 voice signal is inputted, the value of the 

elongation/contraction parameter is updated, and it is 
possible to correct frequency deviation with respect to 
the reference pattern. The recognition performance is 
thus improved. 

25 In addition, in the second embodiment of the 

present invention the elongation/contraction parameter 
estimation is executed by using Formula (5) for making 
the HMM maximum likelihood estimation Q function minimum. 
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Thus, the elongation/contraction parameter estimation 
can be obtained as continuous values, and it is thus 
possible to expect recognition performance improvement 
compared to the case of using preliminarily prepared 
5 discrete values. 

A third embodiment of the present invention will 
now be described. Fig. 5 is a view showing the 
construction of the third embodiment of the present 
invention. Referring to Fig. 5, in the third embodiment 

10 the present invention is applied to a pattern learning 
system, which comprises a learning voice memory 201, a 
reference pattern estimation unit 202 and a likelihood 
judging unit 203 in addition to the spectrum converter 
100 in the first embodiment. 

15 The learning voice memory 201 stores voice signals 

used for learning HMM. The reference pattern estimating 
unit 2 0 estimates HMM parameter by using converted input 
pattern fed out from the spectrum converter 100 and HMM. 
The estimation may be best likelihood estimation as 

20 described in Literature 4. The likelihood judging unit 
2 03 obtains distances corresponding to all learning 
voice signals by using the converted input pattern fed 
out from the spectrum converter 100 and HMM. Where the 
reference patterns are those in the HMM case, the distance 

25 is obtained by using such a method as Viterbi algorithm 
or forward/backward algorithm as described in Literature 
5 . 

While the third embodiment of the present invention 
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has been described in connection with the learning of 
HMM, the present invention is applicable to the learning 
of any parameter concerning voice recognition. 

Fig. 6 is a flow chart for describing the process 
5 in the third embodiment of the present invention. The 
entire operation of the third embodiment of the present 
invention will now be described in detail with reference 
to Figs. 5 and 6. First, a learning voice signal is 
inputted to the spectrum analyzer 1 in the spectrum 

10 converter 100 (step C101 in Fig. 6). The analyzer 1 
analyzes the learning voice signal and feeds out an input 
pattern (step CI 02). The elongation/contraction 
estimating unit 3 estimates the elongation/contraction 
parameter (step CI 03). The converter 2 executes input 

15 pattern conversion and feeds out a converted input 

pattern (step C104). The reference pattern estimating 
unit 2 02 executes HMM estimation by using the converted 
input pattern and HMM (step C105). The likelihood 
judging unit 203 obtains likelihood corresponding to all 

20 the voice signals, and compares the change in likelihood 
and a threshold (C106). When the change in likelihood 
is less than the threshold, the reference pattern memory 
4 is updated with the HMM estimated in the reference 
pattern estimating unit 202, thus bringing an end to the 

25 learning. When the change in likelihood is greater than 
the threshold, the likelihood judging unit 203 updates 
the reference pattern memory 4 with HMM estimated by the 
reference pattern estimating unit 202, and the sequence 
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of processes is repeated from the learning voice data 
input process (C101). 

The third embodiment of the present invention has 
the following effects. In the third embodiment of the 
5 present invention, when learning a reference pattern 
obtained for each speaker after correction of the effects 
of frequency elongation and contraction with a warping 
function, the elongation/contraction parameter 
estimation can be executed during the learning process. 

10 Thus, it is possible to reduce the computational effort 
compared to the prior art. In addition. Formula (5) used 
for the elongation/contraction parameter estimation is 
derived by using the best likelihood of HMM, and like 
other HMM parameter estimation cases it can be readily 

15 adapted for use in the course of learning. 

A fourth embodiment of the present invention will 
now be described. Fig. 7 is a view showing the 
construction of the fourth embodiment of the present 
invention. Referring to Fig. 7, the fourth embodiment 

20 of the present invention comprises an inverse converter 
45 in addition to the construction of the first embodiment. 
The inverse converter 5 executes voice quality 
conversion by inversely converting the elongated or 
contracted input pattern time series fed out from the 

25 converter 2 and outputting a signal waveform in time 
domain. 

A fifth embodiment of the present invention will 
now be described. Fig. 8 is a view showing the 
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construction of the fifth embodiment of the present 
invention. In the fifth embodiment of the present 
invention, the above first to fourth embodiments of 
systems are realized program control executed with a 
5 computer. Referring to Fig. 8, in the case of realizing 
the processes in the analyzer 1, the converter 2 and the 
elongation/contraction estimating unit 3 shown in Fig. 
1 by executing program on a computer 10, the program is 
loaded from a recording medium 14, such as CD-ROM, DVD, 

10 FD, Magnetic tape, etc. via a recording medium accessing 
unit 13 in a main memory 12 of the computer 10, and is 
executed in a CPU 11. In the recording medium 14 is 
stored a program for executing, with the computer, an 
analysis process for converting an input voice signal 

15 to an input pattern including cepstrum, an 

elongation/contraction estimating process for 
outputting an elongation/contraction parameter in the 
frequency axis direction by using the input pattern and 
the reference pattern stored in a reference pattern 

20 memory . 

Alternatively, it is possible to record a program, 
for causing execution, with a computer, a matching 
process of computing the distance between the input 
pattern fed out after elongation or contraction and each 
25 reference pattern and outputting the reference pattern 
corresponding to the shortest distance as result of 
recognition. 

A program for causing execution, with the computer, 
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the matching processing for the distance calculation 
between the input pattern after the 

elongation/contraction and the reference pattern, and 
outputting the reference pattern having the minimum 
5 distance as a recognition result, may be recorded in the 
recording medium. 

As a different alternative, it is possible to store 
in the recording medium 14 a program for causing execution, 
with the computer, an analysis process for converting 
10 a learning voice data stored in a learning voice memory 
for storing learning voice data to an input pattern 
containing cepstrum, an elongation/contraction 
estimating process for outputting an 

elongation/contraction parameter in the frequency axis 
15 direction by using the input pattern and the reference 
pattern stored in a reference pattern memory, a 
converting process for converting the input pattern by 
using the elongation/contraction parameter, a reference 
pattern estimating process for updating the reference 
20 pattern with respect to the learning voice by using 
elongated or contracted input pattern fed out after the 
conversion process and the reference patterns, and a 
likelihood judging process of monitoring changes in 
distance by computing the distance through utilization 
25 of the elongated or contracted input pattern and 

reference patterns. It will be seen that in the second 
to fourth embodiments it is possible to realize like 
program control. It is also possible to down-load 
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program from a server (not shown) via a network or like 
transfer medium. In other words, as the recording medium 
may be used any recording medium, such as communication 
medium, so long as it can hold program. 
5 As has been described in the foregoing, according 

to the present invention it is possible to obtain the 
following advantages . 

A first advantage is to reduce computational effort 
required for the calculation of optimum parameter for 

10 recognition performance in the voice signal spectrum 
frequency elongation or contraction. This is so because 
according to the present invention it is adopted that 
the conversion in primary full band-pass or like filter 
process with respect to the frequency axis can be solved 

15 in the form of elongation/contraction parameter power 
series in cepstrum domain. Thus, when the series is 
approximated by a first degree function, a function of 
elongation/contraction parameter for minimizing the 
function for the best likelihood estimation can be 

20 described in a ready function to be used for calculation. 

A second advantage is to make it possible to 
estimate elongation/contraction parameter 
simultaneously with other parameters at the time of the 
HMM learning. This is so because according to the 

25 present invention the function for calculating the 

elongation/contraction parameter is derived from the Q 
function for the best likelihood estimation in voice 
recognition. 
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Changes in construction will occur to those skilled 
in the art and various apparently different modifications 
and embodiments may be made without departing from the 
scope of the present invention. The matter set forth in 
the foregoing description and accompanying drawings is 
offered by way of illustration only. It is therefore 
intended that the foregoing description be regarded as 
illustrative rather than limiting. 
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What is claimed is: 

1 . A voice recognition system comprising a spectrum 
converter for elongating or contracting the spectrum of 
a voice signal on the frequency axis, the spectrum 
converter including: 

an analyzer for converting an input voice signal to 
an input pattern including cepstrum; 

a reference pattern memory with reference patterns 
stored therein; 

an elongation/contracting estimating unit for 
outputting an elongation/contraction parameter in the 
frequency axis direction by using the input pattern and 
the reference patterns; and 

a converter for converting the input pattern by using 
the elongation/contraction parameter. 

2. A voice recognition system comprising: 

an analyzer for converting an input voice signal to 
an input pattern including a cepstrum; 

a reference pattern memory for storing reference 
patterns ; 

an elongation/contraction estimating unit for 
outputting an elongation/contraction parameter in the 
frequency axis direction by using the input pattern and 
reference patterns; 

a converter for converting the input pattern by using 
the elongation/contraction parameter; and 

a matching unit for computing the distances between 
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the elongated or contracted input pattern fed out from the 
converter and the reference patterns and outputting the 
reference pattern corresponding to the shortest distance 
as result of recognition. 

3 . The voice recognition system according to claim 
1 or 2, wherein the converter executes the elongation or 
contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. 

4. The voice recognition system according to one 
of claims 1 to 3, wherein the elongation/contraction 
estimating unit executes the elongation or contraction of 
spectrum on frequency axis with warping function defining 
the form of elongation or contraction by using estimation 
derived from the best likelihood estimation of HMM (hidden 
Marcov model) in cepstrum space. 

5. A reference pattern learning system comprising: 
a learning voice memory with learning voice data 

stored therein; 

an analyzer for receiving a learning voice signal 
from the learning voice memory and converting the learning 
voice signal to an input pattern including cepstrum; 

a reference pattern memory with reference patterns 
stored therein; 
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an elongation/contraction estimating unit for 
outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
the reference patterns; 

a converter for converting the input pattern by using 
the elongation/contraction pattern; 

a reference pattern estimating unit for updating the 
reference patterns stored in the reference pattern memory 
for the learning voice data by using the elongated or 
contracted input pattern fed out from the converter and 
the reference patterns; and 

a likelihood judging unit for monitoring distance 
changes by computing distances by using the elongated or 
contracted input pattern fed out from the converter and 
the reference patterns. 

6 - The reference pattern learning system according 
to claim 5, wherein the converter executes the elongation 
or contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. 

7 . The reference pattern learning system according 
to claim 5 or 6, wherein the elongation/contraction 
estimating unit executes the elongation or contraction of 
spectrum on frequency axis with warping function defining 
the form of elongation or contraction by using estimation 
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derived from the best likelihood estimation of HMM (hidden 
Marcov model) in cepstrum space. 

8. A voice quality converting system comprising: 
an analyzer for converting an input voice signal to 

an input pattern including a cepstrum; 

a reference pattern memory for storing reference 
patterns ; 

an elongation/contraction estimating unit for 
outputting an elongation/contraction parameter in the 
frequency axis direction by using the input pattern and 
reference patterns ; 

a converter for converting the input pattern by using 
the elongation/contraction parameter; and 

an inverse converter for outputting a signal 
waveform in time domain by inversely converting the time 
serial input pattern obtained after the 
elongation/contraction supplied from the converter. 

9 . A recording medium for a computer constituting 
a spectrum converter by executing elongation or 
contraction of the spectrum of a voice signal on frequency 
axis, in which is stored a program for executing the 
following processes: 

(a) an analyzing process for converting an input 
voice signal to an input pattern including cepstrum, 

(b) an elongation/contraction estimating process 
for outputting an elongation/contraction parameter in 
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frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern memory; 
and 

(c) a converting process for converting the input 
pattern by using the elongation/contraction parameter. 

10. A recording medium for a computer constituting 
a system for voice recognition by executing elongation or 
contraction of the spectrum of a voice signal on frequency 
axis, in which is stored a program for executing the 
following processes: 

(a) an analyzing process for converting an input 
voice signal to an input pattern including cepstrum, 

(b) an elongation/contraction estimating process 
for outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern memory; 

(c) a converting process for converting the input 
pattern by using the elongation/contraction parameter; 
and 

(d) a matching process for computing the distances 
between the elongated or contracted input pattern and the 
reference patterns and outputting the reference pattern 
corresponding to the shortest distance as result of 
recognition. 

11. The recording medium according to claim 10 , 
wherein the converting process executes the elongation or 
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contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. 

12. The recording medium according to claim 10, 
wherein the elongation/contraction estimating process 
executes the elongation or contraction of spectrum on 
frequency axis with warping function defining the form of 
elongation or contraction by using estimation derived from 
the best likelihood estimation of HMM (hidden Marcov 
model) in cepstrum space. 

13. In a computer constituting a system for 
learning reference patterns from learning voice data, a 
recording medium, in which is stored a program, for 
executing the following processes: 

(a) an analyzing process for receiving learning 
voice data from learning voice memory with learning voice 
data stored therein and converting the received learning 
voice data to an input pattern including cepstrum; 

(b) an elongation/contraction estimating process 
for outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
the reference patterns stored in the reference pattern 
memory ; 

(c) a converting process for converting the input 
pattern by using the elongation/contraction parameter; 
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(d) a reference pattern estimating process for 
updating the reference patterns for the learning voice 
data by using the elongated or contracted pattern fed out 
in the converting process and the reference patterns and; 

(e) a likelihood judging process for calculating the 
distances between the elongated or contracted input 
pattern after conversion in the converting process and the 
reference patterns and monitoring changes in distance. 

14. The recording medium according to claim 13, 
wherein the converting process executes the elongation or 
contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. 

15. The recording medium according to claim 13, 
wherein the elongation/contraction estimating process 
executes the elongation or contraction of spectrum on 
frequency axis with warping function defining the form of 
elongation or contraction by using estimation derived from 
the best likelihood estimation of HMM (hidden Marcov 
model) in cepstrum space. 

16. A recording medium for a computer constituting 
a spectrum conversion by executing elongation or 
contraction of the spectrum of a voice signal on frequency 
axis, in which is stored a program for executing the 
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following processes: 

(a) an analyzing process for converting an input 
voice signal to an input pattern including cepstrum, 

(b) an elongation/contraction estimating process 
for outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern 
memory; (c) a converting process for converting the 
input pattern by using the elongation/contraction 
parameter; and 

(d) an inverse converting process for outputting a 
signal waveform in time domain by inversely converting the 
time serial input pattern obtained after the 
elongation/contraction supplied from the converter. 

17 . A spectrum converting method for elongating or 
contracting the spectrum of a voice signal on the frequency 
axis , compr is ing : 

a first step for converting an input voice signal 
to an input pattern including cepstrum; 

a second step for outputting an 
elongation/contraction parameter in the frequency axis 
direction by using the input pattern and the reference 
patterns stored in a reference pattern memory; and 

a third step for converting the input pattern by 
using the elongation/contraction parameter. 

18. A voice recognition method comprising: 
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a first step for converting an input voice signal 
to an input pattern including a cepstrum; 

a second step for outputting an 
elongation/contraction parameter in the frequency axis 
direction by using the input pattern and reference 
patterns stored in a reference pattern memory; 

a third step for converting the input pattern by 
using the elongation/contraction parameter; and 

a fourth step for computing the distances between 
the elongated or contracted input pattern and the 
reference patterns and outputting the reference pattern 
corresponding to the shortest distance as result of 
recognition . 

1 9 . The voice recognition method according to claim 
17 or 18, wherein the elongation or contraction of spectrum 
on frequency axis with warping function defining the form 
of elongation or contraction is executed by carrying out 
the elongation or contraction in cepstrum space. 

20. The voice recognition method according to one 
of claims 17 to 19, wherein the elongation/contraction 
estimating process executes the elongation or contraction 
of spectrum on frequency axis with warping function 
defining the form of elongation or contraction by using 
estimation derived from the best likelihood estimation of 
HMM (hidden Marcov model) in cepstrum space. 
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21. A reference pattern learning method 
comprising: 

a first step for receiving a learning voice signal 
from the learning voice memory and converting the learning 
voice signal to an input pattern including cepstrura; 

a second step for outputting an 
elongation/contraction parameter in frequency axis 
direction by using the input pattern and the reference 
patterns stored in a reference pattern memory; 

a third step for converting the input pattern by 
using the elongation/contraction pattern; 

a fourth step for updating the reference patterns 
for the learning voice data by using the elongated or 
contracted input pattern and the reference patterns ; and 

a fifth step for monitoring distance changes by 
computing distances by using the elongated or contracted 
input pattern and the reference patterns. 

22. The reference pattern learning method 
according to claim 21, wherein the third step executes the 
elongation or contraction of spectrum on frequency axis 
with warping function defining the form of elongation or 
contraction by carrying out the elongation or contraction 
in cepstrum space. 

23. The reference pattern learning method 
according to claim 21, wherein the second step executes 
the elongation or contraction of spectrum on frequency 
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axis with warping function defining the form of elongation 
or contraction by using estimation derived from the best 
likelihood estimation of HMM (hidden Marcov model) in 
cepstrum space. 

24. A voice recognition method of spectrum 
conversion to convert the spectrum of a voice signal by 
executing elongation or contraction of the spectrum on 
freguency axis, wherein: 

the spectrum elongation or contraction of the input 
voice signal as defined by a warping function is executed 
on cepstrum, the extent of elongation or contraction of 
the spectrum on the frequency axis is determined with 
elongation/contraction parameter included in warping 
function, and an optimum value is determined as 
elongation/contraction parameter value for each speaker. 
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ABSTRACT QF THE PIgCLQgURE 

A voice recognition system comprises an analyzer 
1 for converting an input voice signal to an input pattern 
including cepstrum, a reference pattern memory 3 for 
5 storing reference patterns, an elongation/contraction 
estimating unit 4 for outputting an 
elongation/contraction parameter in frequency axis 
direction by using the input pattern and the reference 
patterns, and a recognizing unit 101 for calculating the 

10 distances between the converted input pattern from the 
converter 2 and the reference patterns and outputting 
the reference pattern corresponding to the shortest 
distance as result of recognition. The 
elongation/contraction unit 4 estimates an 

15 elongation/contraction parameter by using cepstrum 

included in the input pattern. The unit 4 does not have 
various values in advance for determining the 
elongation/contraction parameter. Nor the unit 4 have 
to execute distance calculation for various values. 
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